Note: In this document, all R code is hidden by default to improve readability. You can reveal individual code chunks by clicking “Show” or display all code at once by clicking the “Code” button at the top right.

Introduction

The purpose of this project is to demonstrate the implementation and evaluation of vertical scaling for educational assessment. We’ll be expanding on examples provided in the documentation for the plink package.

This notebook will cover:

  1. IRT common-item vertical scaling for two groups (Example 1) and six groups (Example 2)
  2. Simulating response data for multi-group (Example 1) and longitudinal (Example 2) data.
  3. Evaluating vertical scaling results.
  4. Comparing different scaling techniques for score reporting.
  5. Conducting simple growth modeling on longitudinal scaled scores.

Vertical scaling is a statistical technique used in educational testing to create a common scale for measuring student performance across different grade levels or age groups. This approach allows educators and researchers to track and compare student progress over time, ensuring that test scores are interpretable and comparable across grades.

A few key concepts related to Vertical Scaling:

The two examples in this project use common item vertical scaling of unidimensional item parameters. The first example is a simple two-group example, and the second is a more complex example of scaling spanning 6 grades (Grades 3 - 8). In a common items scaling design, all grades are given separate tests, with adjacent grades sharing common items. The common items are used to form a linking chain. So for instance, in our example with 6 grades, the lowest grade (Grade 3) is used as the base scale, and the Grade 4 items are linked the base scale using common items, and the Grade 5 items are linked to the linked Grade 4 items, and so on.

There are other approaches to vertical scaling, such as a scaling test design. In this design, a ‘scaling test’ spanning the entire breadth of content for all grades is administered to a representative sample of students with participation from each grade. Then everyone in each grade is also given a ‘level test’ for their grade. This scaling test is used to place level test scores along the developmental scale.

For additional reading on Vertical Scaling, I’ll recommend:

Prepare Workspace

First we’ll load the relevant packages, including custom package CATFunctions to support the MLE estimation, and create a function to employ parallel process for efficient ability estimation during simulation.

library(plink)     # Links tests
library(psych)     # Descriptive Stats
library(janitor)   # Clean up data
library(tidyverse) # Keep things tidy
library(mirt)      # IRT Modeling
library(parallel)  # To run heavy  est_abilities_parallel function 
library(ggthemes)  # for theme_clean()
library(purrr)     # Functional programming
library(lme4)      # Growth curve modeling
library(nlme)      # For estimating linear models w/ MLE
library(CATFunctions) # devtools::install_github("scottfrohn/CATFunctions")
options(digits = 3)

# Parallel processing function
est_abilities_parallel <- function(
    # Estimates abilities for all cases in a dataframe of item responses
  response_df,      # dataframe of item responses
  as,               # a params for each item in the response_df
  bs,               # b params for each item in the response_df 
  cs                # c params for each item in the response_df
) {
  # Split response_DF into a list of vectors for every case
  responses_list = split(response_df, seq(nrow(response_df)))
  # Use all but 1 core
  num_cores <- detectCores() - 1
  # Make cluster for the # of selected cores
  cl <- makeCluster(num_cores)
  # Export local variables
    clusterExport(cl, 
                  c("est_ability_mle", "prob", "check_lengths", "as", "bs", "cs"), 
                  envir = environment())
    # Estimate ability for all responses using all identified cores
    abilities <- parLapply(cl, responses_list, function(responses) {
      est_ability_mle(responses, as, bs, cs)
      }
      )
      stopCluster(cl) 
  # Combine it all together
  abilities <- do.call(bind_rows, lapply(abilities, function(x) {
    data.frame(ability_est = x$ability_est, ability_est_se = x$ability_est_se)
  }))
  abilities
}

Example 1: Two Groups

The Dataset

KB04 / Kolen - Brennan (2004): “This dataset includes three parameter logistic model (3PL) item parameter estimates for two forms of a unidimensional test for two groups. Both forms were calibrated separately. The Form X items are from a”new” form and the Form Y items are from an “old” form. The data are listed in Table 6.5 in Test Equating, Scaling, and Linking (2nd ed.). There are 12 common items between the two forms.” - Weeks (2010)

Note the same data are listed in the same table in Kolen and Brennan (2014) (the 3rd edition).

We can load the KB04 dataset from plink. Terms I’ll use to describe the data herein are:

  • Form X: the first form
  • Group X: the simulated group who took Form X.
  • Form Y: the second form
  • Group Y: the simulated group who took Form Y.
  • Sims: simulated respondents

Form X

KB04 <- KB04

KB04$pars$form.x
##    discrimination difficulty asymptote
## 1           0.550    -1.7960    0.1751
## 2           0.789    -0.4796    0.1165
## 3           0.455    -0.7101    0.2087
## 4           1.444     0.4833    0.2826
## 5           0.974    -0.1680    0.2625
## 6           0.584    -0.8567    0.2038
## 7           0.860     0.4546    0.3224
## 8           1.145    -0.1301    0.2209
## 9           0.754     0.0212    0.1600
## 10          0.917     1.0139    0.3648
## 11          0.959     0.7218    0.2399
## 12          0.663     0.0506    0.1240
## 13          1.232     0.4167    0.2535
## 14          1.049     0.7882    0.1569
## 15          1.069     0.9610    0.2986
## 16          0.919     0.6099    0.2521
## 17          0.893     0.5128    0.2273
## 18          0.967     0.1950    0.0535
## 19          0.656     0.3953    0.1201
## 20          1.056     0.9481    0.2036
## 21          0.348     2.2768    0.1489
## 22          0.843     1.0601    0.2332
## 23          1.114     0.5826    0.0644
## 24          1.458     1.0241    0.2453
## 25          0.514     1.3790    0.1427
## 26          0.919     1.0782    0.0879
## 27          1.881     1.4062    0.1992
## 28          1.504     1.5093    0.1642
## 29          0.966     1.5443    0.1431
## 30          0.702     2.2401    0.0853
## 31          1.265     1.8759    0.2443
## 32          0.857     1.7140    0.0865
## 33          1.408     1.5556    0.0789
## 34          0.581     3.4728    0.1399
## 35          0.926     3.1202    0.1090
## 36          1.299     2.1589    0.1075

This is the unaltered KB04$pars$form.x table from our dataset.

Form Y

KB04$pars$form.y
##    discrimination difficulty asymptote
## 37          0.870    -1.4507    0.1576
## 38          0.463    -0.4070    0.1094
## 39          0.442    -1.3349    0.1559
## 40          0.545    -0.9017    0.1381
## 41          0.620    -1.4865    0.2114
## 42          0.573    -1.3210    0.1913
## 43          1.175     0.0691    0.2947
## 44          0.445     0.2324    0.2723
## 45          0.599    -0.7098    0.1177
## 46          0.848    -0.4253    0.1445
## 47          1.032    -0.8184    0.0936
## 48          0.604    -0.3539    0.0818
## 49          0.830    -0.0191    0.1283
## 50          0.725    -0.3155    0.0854
## 51          0.990     0.5320    0.3024
## 52          0.775     0.5394    0.2179
## 53          0.594     0.8987    0.2299
## 54          0.808    -0.1156    0.0648
## 55          0.964    -0.1948    0.1633
## 56          0.784     0.3506    0.1299
## 57          0.414     2.5538    0.2410
## 58          0.762    -0.1581    0.1137
## 59          1.196     0.5056    0.2397
## 60          1.355     0.5811    0.2243
## 61          1.187     0.6229    0.2577
## 62          1.030     0.3898    0.1856
## 63          1.042     0.9392    0.1651
## 64          1.206     1.1350    0.2323
## 65          0.970     0.6976    0.1070
## 66          0.634     1.8960    0.0794
## 67          1.082     1.3864    0.1855
## 68          1.020     0.9197    0.1027
## 69          1.135     1.0790    0.0630
## 70          1.195     1.8411    0.0999
## 71          1.196     2.0297    0.0832
## 72          0.925     2.1337    0.1259

This is the unaltered KB04$pars$form.y table from our dataset. Note the rows are labeled 36:72 (instead of 1:36).

Common Items

KB04$common
##       form.x form.y
##  [1,]      3      3
##  [2,]      6      6
##  [3,]      9      9
##  [4,]     12     12
##  [5,]     15     15
##  [6,]     18     18
##  [7,]     21     21
##  [8,]     24     24
##  [9,]     27     27
## [10,]     30     30
## [11,]     33     33
## [12,]     36     36

This is the unaltered KB04$common table from our dataset. Note the numbers under form.y reference the row index, and not the row label. Therefore the ‘3’ here refers to item labeled ‘37’ in the KB04$pars$form.y table. The scaling functions do not use the row labels, but rather the indices.

Item Evaluation

Let’s take a look at the ICCs of our common items and see how they compare from the separate calibrations.

# Named vector to change the default item names generated by item_ICC_bank (row names)
recode_vector <- setNames(KB04$common[,"form.x"], c(1:12))

# Table that contains points for ICCs
common.pars <- KB04$pars$form.x[KB04$common[,"form.x"],] %>%
  rename_with(., ~ c("a","b","c")) %>% 
  item_ICC_bank %>%
  mutate(form = "x") %>%
  bind_rows(
    KB04$pars$form.y[KB04$common[,"form.y"],] %>%
      rename_with(., ~ c("a","b","c")) %>% 
      item_ICC_bank %>%
      mutate(form = "y")
) %>% 
  mutate(item = recode(item,
                       !!!recode_vector))

# Faceted ICC plots
common.pars %>% 
  ggplot(., aes(x = thetas, y = icc, color = form)) +
  geom_line() +
  facet_wrap(~ item) +
  scale_color_manual(values = c("x" = "green3", "y" = "blue3")) +
  labs(
    title = "Item Characteristic Curves",
    x = expression(theta),
    y = "Probability of Correct Response"
  ) +
  theme_clean() +
  theme(legend.position = "bottom")

Item 27 seems to stand out, performing considerably differently across two groups. We could perform analyses to detect the extent of this differential item functioning, but that’s beyond the scope of this project. We’ll just assume 27 demonstrates unacceptable DIF, and leave that out in our scaling.

Vertical Scaling

Before we conduct vertical scaling, there are a few decisions we need to make. For instance:

  1. What method to transform (rescale) the IRT parameters should we use? There are a few ways, some using the means and standard deviations of common items (Mean/Mean "MM", Mean/Sigma "MS"), and some using item characteristic curve transformations (Haebara "HB", Stocking-Lord "SL"). Each of these methods will produce Slope (A) and Intercept (B) parameters to linearly transform item parameters from Form X to the scale of Form Y. Let’s just go with Mean/Sigma.
  2. What base group (base.grp) should we use? Let’s use Form Y (base.grp = 2) for the base scale, transforming Form X item parameters to the Form Y scale. The base group number is defined by the order in which the group appears in the irt.pars class object we pass into the model (for us it’s KB04$pars).
  3. Do we want to use a scaling constant (D)? Using a constant like D = 1.7 would transform our results to the normal ogive model. For us, since the KB04 items were calibrated with BILOG-MG using the default 1.7 scaling constant, we’ll use the constant also.
  4. Do we need to exclude any of the common items? Based on our evaluation of the common item ICCs from above, let’s drop item 27 (exclude = list(27, NA)).
# Create irt.pars object with two groups (all dichotomous items)
kb04.pm <- as.poly.mod(36)  # Create a poly.mod object with 36 items
                            # Default model = "drm" for dichotomous responses model
                            # All items are of the same model family (drm)

kbo4.irt.pars <- as.irt.pars(       # Turn values into an `irt.pars` object
  x = KB04$pars,                    # An object of class `irt.pars` with multiple groups
  common = KB04$common,             # Matrix of common items between the groups
  cat = list(rep(2,36),rep(2,36)),  # Number of response categories for all 36 
                                    # items in the datasets; Two response cats
  poly.mod = list(kb04.pm, kb04.pm) # Lists the poly.mod objects that defines 
                                    # the models for each item (all are the 
                                    # same, 36 'drm' items)
                             )

# rescale the item parameters using the Mean/Sigma linking constants,
# and exclude item 27 from the common item set
kb04.out <- plink(kbo4.irt.pars, 
             rescale = "MS",        # Transform parameters to the base scale, using Mean/Sigma (Mean/SD) method
             base.grp = 2,          # The base scale is group 2 (form.y)
             D = 1.7,               # The scaling constant to transform the logistic model; D=1.7 is equivalent to the normal ogive model
             exclude = list(27,NA)) # Excludes item 27 from the linking process

# And get the new item parameters. 
kb04.pars.out <- link.pars(kb04.out)

Scaling Summary

Now let’s examine a summary of the scaling.

summary(kb04.out, descrip = TRUE)
## -------  group1/group2*  -------
## Linking Constants
## 
##                      A         B
## Mean/Mean     1.144959 -0.478967
## Mean/Sigma    1.176074 -0.504188
## Haebara       1.088361 -0.441590
## Stocking-Lord 1.097363 -0.464035
## 
## Common Item Descriptive Statistics
## 
## Model: 3PL 
## Number of Items: 11 
## 
##                    a         b         c
## N Pars:    11.000000 11.000000 11.000000
## Mean: To    0.770809  0.449127  0.149773
## Mean: From  0.882545  0.810591  0.155864
## SD: To      0.285828  1.293511  0.076785
## SD: From    0.366505  1.099855  0.072778

What this output tells us:

  1. The 11 common items were all 3pl calibrated (we had 12, but dropped item 27).

  2. The Slope (A) and Intercept (B) parameters for the different types of linear scaling methods are presented. Since we used rescale = 'MS', our parameters are in the row marked Mean/Sigma which can be used to rescale the 3pl item parameters for Form X (since Form Y was our base form). These parameter are:

  • A (slope) = 1.176074
  • B (intercept) = -0.504188
  1. Because these are the linear differences in common items, we can somewhat interpret the intercept (about -0.50), to indicate that that the group that took Form X are on average, about 1/2 a Standard Deviation lower in ability than the group that took Form Y

  2. The last table shows the descriptive stats of parameters for the common items for Form Y (To) and Form X (From). Note the SD calculated here is the population SD, so if you’re checking the calculations, the R function sd() only reports sample SD and will be slightly off.

Verify Scaling Results

Let’s check those item parameters from the final scaling object (kb04.out). Note that Form Y (group2) parameters should be the same, and Form X (group1) parameters will be different.

Form Y

Since they shouldn’t have changed, let’s check Form Y first.

# Pull out the form y parameters
(form.y.params <- KB04$pars$form.y %>%  # 1. Original parameters, "before" scaling
  rename(a.before = "discrimination",
         b.before = "difficulty",
         c.before = "asymptote") %>%
  bind_cols(kb04.pars.out$group2 %>%    # 2. New parameters, "after" scaling"
              as.data.frame() %>%
              setNames(c("a.after","b.after","c.after"))) %>%
   mutate(item = seq(1:nrow(kb04.pars.out$group2)),
          common = ifelse(item %in% data.frame(kb04.out$pars@common)[,2],
                          1,0)) %>%
   relocate(item, common, a.before, a.after, b.before, b.after, c.before, c.after)
)
##    item common a.before a.after b.before b.after c.before c.after
## 37    1      0    0.870   0.870  -1.4507 -1.4507   0.1576  0.1576
## 38    2      0    0.463   0.463  -0.4070 -0.4070   0.1094  0.1094
## 39    3      1    0.442   0.442  -1.3349 -1.3349   0.1559  0.1559
## 40    4      0    0.545   0.545  -0.9017 -0.9017   0.1381  0.1381
## 41    5      0    0.620   0.620  -1.4865 -1.4865   0.2114  0.2114
## 42    6      1    0.573   0.573  -1.3210 -1.3210   0.1913  0.1913
## 43    7      0    1.175   1.175   0.0691  0.0691   0.2947  0.2947
## 44    8      0    0.445   0.445   0.2324  0.2324   0.2723  0.2723
## 45    9      1    0.599   0.599  -0.7098 -0.7098   0.1177  0.1177
## 46   10      0    0.848   0.848  -0.4253 -0.4253   0.1445  0.1445
## 47   11      0    1.032   1.032  -0.8184 -0.8184   0.0936  0.0936
## 48   12      1    0.604   0.604  -0.3539 -0.3539   0.0818  0.0818
## 49   13      0    0.830   0.830  -0.0191 -0.0191   0.1283  0.1283
## 50   14      0    0.725   0.725  -0.3155 -0.3155   0.0854  0.0854
## 51   15      1    0.990   0.990   0.5320  0.5320   0.3024  0.3024
## 52   16      0    0.775   0.775   0.5394  0.5394   0.2179  0.2179
## 53   17      0    0.594   0.594   0.8987  0.8987   0.2299  0.2299
## 54   18      1    0.808   0.808  -0.1156 -0.1156   0.0648  0.0648
## 55   19      0    0.964   0.964  -0.1948 -0.1948   0.1633  0.1633
## 56   20      0    0.784   0.784   0.3506  0.3506   0.1299  0.1299
## 57   21      1    0.414   0.414   2.5538  2.5538   0.2410  0.2410
## 58   22      0    0.762   0.762  -0.1581 -0.1581   0.1137  0.1137
## 59   23      0    1.196   1.196   0.5056  0.5056   0.2397  0.2397
## 60   24      1    1.355   1.355   0.5811  0.5811   0.2243  0.2243
## 61   25      0    1.187   1.187   0.6229  0.6229   0.2577  0.2577
## 62   26      0    1.030   1.030   0.3898  0.3898   0.1856  0.1856
## 63   27      0    1.042   1.042   0.9392  0.9392   0.1651  0.1651
## 64   28      0    1.206   1.206   1.1350  1.1350   0.2323  0.2323
## 65   29      0    0.970   0.970   0.6976  0.6976   0.1070  0.1070
## 66   30      1    0.634   0.634   1.8960  1.8960   0.0794  0.0794
## 67   31      0    1.082   1.082   1.3864  1.3864   0.1855  0.1855
## 68   32      0    1.020   1.020   0.9197  0.9197   0.1027  0.1027
## 69   33      1    1.135   1.135   1.0790  1.0790   0.0630  0.0630
## 70   34      0    1.195   1.195   1.8411  1.8411   0.0999  0.0999
## 71   35      0    1.196   1.196   2.0297  2.0297   0.0832  0.0832
## 72   36      1    0.925   0.925   2.1337  2.1337   0.1259  0.1259
# Check the number of items that have different before and after parameters 
form.y.params %>%
  mutate(a.diff = a.before - a.after, # Difference between the params before and after scaling
         b.diff = b.before - b.after,
         c.diff = c.before - c.after) %>%
  select(a.diff, b.diff, c.diff) %>%
  summarise(a_differences = sum(a.diff != 0),
            b_differences = sum(b.diff != 0),
            c_differences = sum(c.diff != 0)) %>% 
  t() %>% data.frame("N.items" = .)
##               N.items
## a_differences       0
## b_differences       0
## c_differences       0

The first table shows a side-by-side comparison of parameters ‘before’ and ‘after’ scaling for each item. Looks like nothing has changed (which is what we expect). The second (custom) table provides a summary of the number of different parameters, which we can confirm is 0 for a, b, and c. Looks good, let’s check Form X now.

Form X

So let’s use those linear scaling parameters to manually rescale the original Form X parameters and compare against the output from plink.

Note that the parameter transformation equations are different for each parameter:

  1. Discrimination (a): a’ = a / slope
  2. Difficulty (b): b’ = b * slope + intercept
  3. Asymptote (c): c’ = c
# Get the slope and intercept from the scaling output. If we wanted to, we could change the '$MS' to '$MM', '$HB', or '$SL' to access the other linear parameters for the other 3 scaling methods. 
x.slope <- kb04.out$link@constants$MS[1]
x.intercept <- kb04.out$link@constants$MS[2]

(form.x.params <- KB04$pars$form.x %>%
  rename(a.before = "discrimination",
         b.before = "difficulty",
         c.before = "asymptote") %>%
  bind_cols(kb04.pars.out$group1 %>%
              as.data.frame() %>%
              setNames(c("a.after","b.after","c.after"))) %>%
   mutate(item = seq(1:nrow(kb04.pars.out$group1)),
          common = ifelse(item %in% data.frame(kb04.out$pars@common)[,1],
                          1,0)) %>%
   relocate(item, common, a.before, a.after, b.before, b.after, c.before, c.after)
)
##    item common a.before a.after b.before b.after c.before c.after
## 1     1      0    0.550   0.467  -1.7960 -2.6164   0.1751  0.1751
## 2     2      0    0.789   0.671  -0.4796 -1.0682   0.1165  0.1165
## 3     3      1    0.455   0.387  -0.7101 -1.3393   0.2087  0.2087
## 4     4      0    1.444   1.228   0.4833  0.0642   0.2826  0.2826
## 5     5      0    0.974   0.828  -0.1680 -0.7018   0.2625  0.2625
## 6     6      1    0.584   0.496  -0.8567 -1.5117   0.2038  0.2038
## 7     7      0    0.860   0.732   0.4546  0.0305   0.3224  0.3224
## 8     8      0    1.145   0.973  -0.1301 -0.6572   0.2209  0.2209
## 9     9      1    0.754   0.641   0.0212 -0.4793   0.1600  0.1600
## 10   10      0    0.917   0.780   1.0139  0.6882   0.3648  0.3648
## 11   11      0    0.959   0.816   0.7218  0.3447   0.2399  0.2399
## 12   12      1    0.663   0.564   0.0506 -0.4447   0.1240  0.1240
## 13   13      0    1.232   1.048   0.4167 -0.0141   0.2535  0.2535
## 14   14      0    1.049   0.892   0.7882  0.4228   0.1569  0.1569
## 15   15      1    1.069   0.909   0.9610  0.6260   0.2986  0.2986
## 16   16      0    0.919   0.782   0.6099  0.2131   0.2521  0.2521
## 17   17      0    0.893   0.760   0.5128  0.0989   0.2273  0.2273
## 18   18      1    0.967   0.822   0.1950 -0.2749   0.0535  0.0535
## 19   19      0    0.656   0.558   0.3953 -0.0393   0.1201  0.1201
## 20   20      0    1.056   0.898   0.9481  0.6108   0.2036  0.2036
## 21   21      1    0.348   0.296   2.2768  2.1735   0.1489  0.1489
## 22   22      0    0.843   0.717   1.0601  0.7426   0.2332  0.2332
## 23   23      0    1.114   0.947   0.5826  0.1810   0.0644  0.0644
## 24   24      1    1.458   1.240   1.0241  0.7002   0.2453  0.2453
## 25   25      0    0.514   0.437   1.3790  1.1176   0.1427  0.1427
## 26   26      0    0.919   0.782   1.0782  0.7639   0.0879  0.0879
## 27   27      0    1.881   1.599   1.4062  1.1496   0.1992  0.1992
## 28   28      0    1.504   1.279   1.5093  1.2709   0.1642  0.1642
## 29   29      0    0.966   0.822   1.5443  1.3120   0.1431  0.1431
## 30   30      1    0.702   0.597   2.2401  2.1303   0.0853  0.0853
## 31   31      0    1.265   1.076   1.8759  1.7020   0.2443  0.2443
## 32   32      0    0.857   0.728   1.7140  1.5116   0.0865  0.0865
## 33   33      1    1.408   1.197   1.5556  1.3253   0.0789  0.0789
## 34   34      0    0.581   0.494   3.4728  3.5801   0.1399  0.1399
## 35   35      0    0.926   0.787   3.1202  3.1654   0.1090  0.1090
## 36   36      1    1.299   1.105   2.1589  2.0348   0.1075  0.1075
form.x.params %>%
  mutate(a.calc = a.before / x.slope,
         b.calc = b.before * x.slope + x.intercept,
         c.calc = c.before,
         a.diff = a.before - a.after, # Difference between the params before and after scaling
         b.diff = b.before - b.after,
         c.diff = c.before - c.after,
         a.calc.diff = a.calc - a.after,
         b.calc.diff = b.calc - b.after,
         c.calc.diff = c.calc - c.after) %>%
  select(a.diff, b.diff, c.diff, a.calc.diff, b.calc.diff, c.calc.diff) %>%
  summarise(a_differences = sum(a.diff != 0),
            b_differences = sum(b.diff != 0),
            c_differences = sum(c.diff != 0),
            a_calc_diffs = sum(a.calc.diff != 0),
            b_calc_diffs = sum(b.calc.diff != 0),
            c_calc_diffs = sum(c.calc.diff != 0)) %>% 
  t() %>% data.frame("N.items" = .)
##               N.items
## a_differences      36
## b_differences      36
## c_differences       0
## a_calc_diffs        0
## b_calc_diffs        0
## c_calc_diffs        0

Again, the first table shows a side-by-side comparison of parameters ‘before’ and ‘after’ scaling for each item. Looks like the a’s and b’s have changed, but c’s have not (as expected).

The second table shows that:

  • All (36) of the a parameters are different after scaling.
  • All (36) of the b parameters are different after scaling.
  • None of the c parameters are different after scaling.
  • None of the calculated a, b, and c parameters are different than the plink scaled ones. Our calculated parameters using the “Mean/Sigma” values produce the same results.

Vertical Scaling in Practice

From this point, we can use the updated item parameters and linear scaling parameters for our vertical scale moving forward. This means:

  • All test-takers who take Form X will have abilities estimated with the new (scaled) Form X item parameters.
  • All test-takers who take Form Y will have abilities estimated with the Form Y item parameters
  • All new items for Form X will need to be calibrated to the original Form X scale (using existing Form X items as anchors), then transformed using the linear parameters (A = 1.176074; B = -0.504188).
  • All new items for Form Y will need to be calibrated to the Form Y scale (using existing Form Y items as anchors).

Simulate Data

True (Latent) Abilities

Now that we’ve confirmed the item parameter scaling results, let’s simulate a sample of 1000 latent abilities for two groups and check the impact of scaling. We’ll simulate with a normal distribution for both groups.

# 1. Simulate ability levels
n_test_takers <- 1000

# Group x is lower ability (this will be evident when creating response patterns, no need to change the distribution)
set.seed(123)
theta.x <- rnorm(n_test_takers)
# Group y is higher ability
set.seed(234)
theta.y <- rnorm(n_test_takers)

describe(theta.x) %>%
  bind_rows(describe(theta.y)) %>%
  as_tibble %>%
  mutate(form = c("x","y")) %>%
  relocate(form)
## # A tibble: 2 × 14
##   form   vars     n    mean    sd  median  trimmed   mad   min   max range
##   <chr> <dbl> <dbl>   <dbl> <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x         1  1000  0.0161 0.992 0.00921  0.00896 0.964 -2.81  3.24  6.05
## 2 y         1  1000 -0.0216 0.985 0.0257  -0.00949 0.957 -3.04  3.10  6.13
## # ℹ 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>

Here are descriptive statistics of simulated abilities for Group X and Group Y. Note that although the group means are the same, our vertical scaling will ultimately show that Group Y has a higher ability than Group X. Therefore we don’t need to account for that mean difference here. This is because:

  • Forms X and Y were calibrated separately.
  • The common items reflect that Form Y is more difficult than Form X.
  • Simulating responses based on latent abilities with the same distribution will result in Group X sims performing worse on common items than Group Y sims.
  • After placing Form X and Form Y on the same developmental scale, final ability estimates for Group X will be on average lower than Group Y.

Response Patterns

Now that we have an underlying distribution of latent abilities, we can simulate response patterns for these sims. The CATFunctions::generate_responses functions is designed to simulate item responses based on the 3pl IRT model. It calculates the probability of a sim with a given theta to get an item correct, then simulates a response for that item using the binomial distribution (using rbinom()). The probability \(P_i(\theta)\) of a person with ability \(\theta\) getting item \(i\) correct, given item parameters \(a_i\), \(b_i\), and \(c_i\) is as follows:

\[ P_i(\theta) = c_i + (1 - c_i) \frac{1}{1 + \exp(-a_i (\theta - b_i))} \] Where:

  • \(a_i\) is the discrimination parameter for item \(i\),
  • \(b_i\) is the difficulty parameter for item \(i\), and
  • \(c_i\) is the guessing parameter for item \(i\).

Note that for Form X, we are simulating response patterns based on the old item parameters. Then to simulate abilities, we’ll use the new item parameters.

set.seed(345) # for rbinom

generate_responses <- function(params, theta) {
  a <- params$discrimination
  b <- params$difficulty
  c <- params$asymptote
  n_items <- length(a)
  responses <- matrix(0, nrow = length(theta), ncol = n_items)
  
  for (i in 1:length(theta)) {
    for (j in 1:n_items) {
      prob <- c[j] + (1 - c[j]) / (1 + exp(-a[j] * (theta[i] - b[j])))
      responses[i, j] <- rbinom(1, 1, prob)
    }
  }
  return(responses)
}

# Generate responses for both forms
responses.x <- generate_responses(KB04$pars$form.x, theta.x)
responses.y <- generate_responses(KB04$pars$form.y, theta.y)

# Descriptive stats of sum scores
describe(rowSums(responses.x)) %>% 
  bind_rows(describe(rowSums(responses.y))) %>%
  as_tibble() %>%
  mutate(group = c("x","y")) %>%
  relocate(group)
## # A tibble: 2 × 14
##   group  vars     n  mean    sd median trimmed   mad   min   max range  skew
##   <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x         1  1000  16.7  5.38     16    16.5  4.45     3    33    30 0.399
## 2 y         1  1000  18.9  5.45     19    18.9  5.93     5    35    30 0.122
## # ℹ 2 more variables: kurtosis <dbl>, se <dbl>

This table presents descriptive statistics of sum scores. Since we’ll use MLE to estimate abilities with using IRT item parameters, take these with a grain of salt.

Estimate Abilities

Now let’s estimate the abilities, standard error, and percentile rank for our test-takers.

x.ability.old <- est_abilities_parallel(responses.x
                                        , as = KB04$pars$form.x$discrimination
                                        , bs = KB04$pars$form.x$difficulty
                                        , cs = KB04$pars$form.x$asymptote
                                        ) %>%
  mutate(pct_rank = percent_rank(ability_est))

x.ability.new <- est_abilities_parallel(responses.x
                                        , as = kb04.pars.out$group1[,1]
                                        , bs = kb04.pars.out$group1[,2]
                                        , cs = kb04.pars.out$group1[,3]
                                        ) %>%
  mutate(pct_rank = percent_rank(ability_est))

y.ability <- est_abilities_parallel(responses.y
                                    , as = kb04.pars.out$group2[,1]
                                    , bs = kb04.pars.out$group2[,2]
                                    , cs = kb04.pars.out$group2[,3]
                                    ) %>%
  mutate(pct_rank = percent_rank(ability_est))

# Combine into single DF
kb04.abilities <- x.ability.old %>%
  mutate(group = "x.old", 
         ability_true = theta.x) %>%
  bind_rows(x.ability.new %>% 
              mutate(group = "x.new", 
         ability_true = theta.x)) %>%
  bind_rows(y.ability %>%
              mutate(group = "y", 
         ability_true = theta.y)) %>%
  relocate(ability_true) %>%
  mutate(group = factor(group, levels = c("x.old","x.new","y")))

lapply(list(kb04.abilities$ability_est[kb04.abilities$group=="x.old"],
            kb04.abilities$ability_est[kb04.abilities$group=="x.new"],
            kb04.abilities$ability_est[kb04.abilities$group=="y"]), 
       describe) %>% 
  bind_rows %>%
  as.data.frame() %>%
  mutate(group = c("x.old","x.new","y")) %>%
  as_tibble %>%
  relocate(group, n, mean, sd, min, max, range, median)
## # A tibble: 3 × 14
##   group     n    mean    sd    min   max range  median  vars trimmed   mad
##   <chr> <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 x.old  1000 -0.109   1.52 -10.5   3.44  13.9  0.0543     1  0.0132  1.08
## 2 x.new  1000 -0.626   1.75 -12.1   3.55  15.7 -0.440      1 -0.489   1.27
## 3 y      1000 -0.0491  1.21  -5.78  4.71  10.5  0.0439     1 -0.0145  1.12
## # ℹ 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>

This table shows descriptive statistics for Groups X and Y:

  • x.old are the simulated Form X response data, with the original Form X item parameters
  • x.new are the simulated Form X response data, with the new (scaled) Form X item parameters
  • y are the simulated Form Y response data, with the Form Y item parameters

As you can see, the x.old and y groups have a similar distribution of scores, which makes sense. The abilities were simulated along a normal distribution. When we simulate response patterns based on normally distributed true abilities, we would expect to have similar estimated ability distributions. Once we rescaled the item parameters for Form X (x.new) and estimated abilities from the same response patterns as before, we can see that the distribution shifts:

  • It’s more spread out, which checks out, since the Scaling Summary section showed the SD of the b-parameters for Form X went from 1.099855 to 1.293511, and
  • It’s shifted to the left, which also makes sense, since the mean of the b-parameters for Form X went from 0.810591 to 0.449127.

Visualizations

Now lets visualize our testing data, starting with score distributions.

Distributions

Code to create plots…

raw.est.plot <- data.frame(raw = rowSums(responses.x),
           mle = kb04.abilities$ability_est[kb04.abilities$group=="x.old"],
           group = "x.old") %>%
    bind_rows(
    data.frame(raw = rowSums(responses.x),
               mle = kb04.abilities$ability_est[kb04.abilities$group=="x.new"],
               group = "x.new")
  ) %>%
  bind_rows(
    data.frame(raw = rowSums(responses.y),
               mle = kb04.abilities$ability_est[kb04.abilities$group=="y"],
               group = "y")
  ) %>%
  ggplot(., aes(x = raw, y = mle, color = group)) +
  facet_wrap(~group) +
  geom_point(alpha = 0.3, size = 1) +
  labs(title = "Raw Scores v. Scaled MLE Estimated Ability by Group",
       x = "Raw Score",
       y = "Scaled Ability Estimate",
       color = "Group") +
  theme_clean() +
  theme(legend.position.inside = c(0.95, 0.05),
        legend.position = "inside",
        legend.justification = c("right", "bottom"))  # Adjust justification

ability.plot <- kb04.abilities %>%
  filter(abs(ability_est) <= 6.0) %>%
ggplot(., aes(x = ability_est, color = as.factor(group), fill = as.factor(group))) +
  geom_density(alpha = 0.3) +  # Overlayed density plots with some transparency
  labs(title = "Overlayed Density Plot of Ability Estimates by Group",
       x = "Theta",
       y = "Density",
       color = "Group",
       fill = "Group") +
  scale_x_continuous(limits = c(-5, 5)) +
  scale_y_continuous(limits = c(0, 0.6)) +
  theme_clean() +
    theme(legend.position.inside = c(0.05, 0.95),
    legend.position = "inside",
        legend.justification = c("left", "top"))  # Adjust justification
  

ability.pct.plot <- kb04.abilities %>%
  ggplot(., aes(x = ability_est, y = pct_rank, color = group)) +
  geom_line() +
  scale_x_continuous(limits = c(-5,5)) + 
  labs(title = "Ability and Percentile Rank by Group",
       x = "Theta",
       y = "Percentile Rank",
       color = "Group") +
  theme_clean() +
  theme(legend.position.inside = c(0.05, 0.95),
    legend.position = "inside",
        legend.justification = c("left", "top"))  # Adjust justification
  

se.plot <- kb04.abilities %>%
  filter(abs(ability_est) <= 6.0) %>%
  ggplot(., aes(x = ability_est, y = ability_est_se, color = group)) +
  geom_jitter(alpha = 0.3) + 
  geom_point(alpha = 0.3, size = 1) +
  scale_x_continuous(limits = c(-5, 5)) +
  scale_y_continuous(limits = c(0, NA)) +
  labs(title = "SE of Ability Estimate by Group",
     x = "Theta",
     y = "Standard Error (SE)",
     color = "Group",
     fill = "Group") +
  theme_clean() +
  theme(legend.position.inside = c(0.05, 0.95),
    legend.position = "inside",
        legend.justification = c("left", "top"))  # Adjust justification
Raw and Scaled MLE
print(raw.est.plot)

As we can see, there is a similar relationship between Raw Score and Scaled Ability Estimate for both groups.

Ability Density
suppressWarnings(print(ability.plot))

Looking at the ability density, we can see that x.old and y have similar distributions, but after vertical scaling, the new Group X scores are more spread and shifted left.

Percentile Rank
suppressWarnings(print(ability.pct.plot))

Standard Error
suppressWarnings(print(se.plot))

Density, Info, and SE

Code to create plots…

# Create test information dataframe for plotting
kb04.ti <- test_info(thetas = seq(-4,4,.1),
          a = KB04$pars$form.x$discrimination, # Original X
          b = KB04$pars$form.x$difficulty,
          c = KB04$pars$form.x$asymptote
          ) %>%
  rename(x.old.info = "info", x.old.se = "se") %>%
  bind_cols(
    test_info(thetas = seq(-4,4,.1),
          a = kb04.pars.out$group1[,1], # Scaled X
          b = kb04.pars.out$group1[,2],
          c = kb04.pars.out$group1[,3]
          ) %>%
      rename(x.new.info = "info", x.new.se = "se") %>%
      select(-theta)
  ) %>% 
  bind_cols(
    test_info(thetas = seq(-4,4,.1),
          a = kb04.pars.out$group2[,1], # Y
          b = kb04.pars.out$group2[,2],
          c = kb04.pars.out$group2[,3]
          ) %>%
      rename(y.info = "info", y.se = "se") %>%
      select(-theta)
  )

# Create the combined plot
combined_plot.x.old <- kb04.abilities %>%
  filter(group == "x.old") %>%
  ggplot(aes(x = ability_est)) +
  # Density plot
  geom_density(alpha = 0.1, fill = "red", color = "red") +
  # Test information function
  geom_line(data = kb04.ti, aes(x = theta, 
                           y = x.old.info / max(x.old.info) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.old"])$y)), 
            color = "purple", linewidth = 0.5) +
  # Standard error
  geom_line(data = kb04.ti, aes(x = theta, 
                           y = x.old.se / max(x.old.se) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.old"])$y)), 
            color = "orange3", linewidth = 0.5) +
  # Labels and theme
  labs(title = "Original Form X Ability Estimate Density Plot with Test Info and SE",
       x = "Theta",
       y = "Density of Ability Estimates") +
  theme_clean() +
  scale_x_continuous(limits = c(-6,6)) +
  # Add a secondary y-axis for Test Information and SE
  scale_y_continuous(
    limits = c(0, 0.6),
    sec.axis = sec_axis(~ . * 10, 
                        name = "Test Information / Standard Error")
  ) +
  theme(plot.title = element_text(size = 12))  # Adjust title size


# Create the combined plot
combined_plot.x.new <- kb04.abilities %>%
  filter(group == "x.new") %>%
  ggplot(aes(x = ability_est)) +
  # Density plot
  geom_density(alpha = 0.1, fill = "green", color = "green") +
  # Test information function
  geom_line(data = kb04.ti, aes(x = theta, 
                           y = x.new.info / max(x.new.info) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.new"])$y)), 
            color = "purple", linewidth = 0.5) +
  # Standard error
  geom_line(data = kb04.ti, aes(x = theta, 
                           y = x.new.se / max(x.new.se) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.new"])$y)), 
            color = "orange3", linewidth = 0.5) +
  # Labels and theme
  labs(title = "Scaled Form X Ability Estimate Density Plot with Test Info and SE",
       x = "Theta",
       y = "Density of Ability Estimates") +
  theme_clean() +
  scale_x_continuous(limits = c(-6,6)) +
  # Add a secondary y-axis for Test Information and SE
  scale_y_continuous(
    limits = c(0, 0.6),
    sec.axis = sec_axis(~ . * 10, 
                        name = "Test Information / Standard Error")
  ) +
  theme(plot.title = element_text(size = 12))  # Adjust title size

combined_plot.y <- kb04.abilities %>%
  filter(group == "y") %>%
  ggplot(aes(x = ability_est)) +
  # Density plot
  geom_density(alpha = 0.1, fill = "blue", color = "blue") +
  # Test information function
  geom_line(data = kb04.ti, aes(x = theta, 
                           y = y.info / max(y.info) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "y"])$y)), 
            color = "purple", linewidth = 0.5) +
  # Standard error
  geom_line(data = kb04.ti, aes(x = theta, y = 
                             y.se / max(y.se) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "y"])$y)), 
            color = "orange3", linewidth = 0.5) +
  # Labels and theme
  labs(title = "Form Y Ability Estimate Density Plot with Test Info and SE", size = 2) + 
  labs(x = "Theta",
       y = "Density of Ability Estimates") +
  theme_clean() +
  scale_x_continuous(limits = c(-6,6)) +
  # Add a secondary y-axis for Test Information and SE
  scale_y_continuous(
    limits = c(0, 0.6),
    sec.axis = sec_axis(~ . * 10, 
                        name = "Test Information / Standard Error")
  ) +
  theme(plot.title = element_text(size = 12))  # Adjust title size
x.old
suppressWarnings(print(combined_plot.x.old))

Note: the purple line represents Test Information, and the orange line represents the Test Standard Error.

x.new
suppressWarnings(print(combined_plot.x.new))

Note: the purple line represents Test Information, and the orange line represents the Test Standard Error.

y
suppressWarnings(print(combined_plot.y))

Note: the purple line represents Test Information, and the orange line represents the Test Standard Error.

Score Reporting

Now that we’ve performed vertical scaling for our sample, and estimated abilities for our test takers, we should convert the ability estimates (in logits) to some other scale.. This process is called scaling and it helps score interpretation and tracking for our audience.

Let’s look at the descriptive statistics of abilities of our two groups.

psych::describe(x.ability.new$ability_est) %>% select(n, mean, sd, median, min, max, range) %>% 
  bind_rows(
    psych::describe(y.ability$ability_est) %>% select(n, mean, sd, median, min, max, range)
    ) %>% as_tibble %>%
  mutate(group = c("x.new","y")) %>% relocate(group)
## # A tibble: 2 × 8
##   group     n    mean    sd  median    min   max range
##   <chr> <dbl>   <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>
## 1 x.new  1000 -0.626   1.75 -0.440  -12.1   3.55  15.7
## 2 y      1000 -0.0491  1.21  0.0439  -5.78  4.71  10.5

We have quite the range here, with some extreme values for the x.new group (min = -12.11), and even for the y group (min = -5.78). There are both linear and non-linear methods we could employ to convert these ability estimates to another scale. Let’s look at linear methods first.

Scaling Option 1: Linear based on Limits

Let’s assume we want our transformed scale to have the following limits:

  • Min = 200
  • Max = 800

In this scaling option, we would use the min and max values from our entire dataset (across all forms), and basically draw a line between the min (-12.11, 200) and max (4.71, 800) and all scores in between would be linearly transformed to that scale. Let’s do that:

The formula for this is: \[S = A \cdot \theta + B\] Where:

  • \(S\) is the Scaled score
  • \({\theta}\) is the latent ability estimate
  • \(A\) is the scaling factor
  • \(B\) is the intercept

And the formula for computing A and B: \[A = \frac{S_{\text{max}} - S_{\text{min}}}{\theta_{\text{max}} - \theta_{\text{min}}}\] \[B = S_{\text{min}} - A \cdot \theta_{\text{min}}\]

Let’s compute that and create a scaled.scores dataframe.

scaled.scores <- kb04.abilities %>%
  filter(group %in% c("x.new","y"))

A <- (800 - 200) / (max(scaled.scores$ability_est) - min(scaled.scores$ability_est))
B <- 200 - A * (min(scaled.scores$ability_est))

scaled.scores <- scaled.scores %>%
  mutate(scale_linear = ceiling(A * ability_est + B))

# Plot
scaled.scores %>%
  ggplot(., aes(x = ability_est, y = scale_linear)) +
  geom_point(alpha = 0.2, color = "blue3") +
  labs(title = "Linear Scaling using Scale Limits",
       x = "Theta",
       y = "Scaled Score") +
  theme_clean()

This plot presents the linear scaling results for all sims (Groups X and Y). It also highlights the outliers at the low end of the theta scale, in that values below -5 cover scaled scores of about 200 to 450 or so. In that sense, this method kind of wastes a large portion of our scale on some extreme outliers, and the scaled scores in the middle distribution are relatively narrow. Let’s break up our sample into quintiles and look at the scale range for each group.

scale_breakdown <- function(df, varb) {
  scaling <- function(df, varb) {
    df %>%
      summarise(
        scale_min = min({{varb}}, na.rm = TRUE),
        scale_max = max({{varb}}, na.rm = TRUE),
        scale_range = scale_max - scale_min
      )
  }
  
  result <- bind_rows(
    df %>% scaling({{varb}}),
    df %>% filter(pct_rank < 0.20) %>% scaling({{varb}}),
    df %>% filter(pct_rank >= 0.20 & pct_rank < 0.40) %>% scaling({{varb}}),
    df %>% filter(pct_rank >= 0.40 & pct_rank < 0.60) %>% scaling({{varb}}),
    df %>% filter(pct_rank >= 0.60 & pct_rank < 0.80) %>% scaling({{varb}}),
    df %>% filter(pct_rank >= 0.80) %>% scaling({{varb}})
  ) %>%
  mutate(group = c("Total", "0%-20%", "20%-40%", "40%-60%", "60%-80%", "80%-100%")) %>%
  relocate(group)
  
  return(result)
}

scaled.scores %>% scale_breakdown(scale_linear)
##      group scale_min scale_max scale_range
## 1    Total       200       800         600
## 2   0%-20%       200       598         398
## 3  20%-40%       575       623          48
## 4  40%-60%       605       643          38
## 5  60%-80%       627       664          37
## 6 80%-100%       655       800         145

This shows that the scale range for the bottom 20% of our sample spans 398 scaled points (nearly 2/3 of our range!). Not only is that a lot of dedicated scale range for a mere fifth of our sample, but examining our Info and SE plots, we have more error than information below about -1.5 logits, meaning we don’t have much score precision for MLE estimates in that range anyway. Conversely, the middle fifth of our sims (40% - 60%) is separated by a mere 38 points. That’s not much separation for a group of test-takers for which we do have a lot of measurement precision.

Let’s consider another linear approach provides better separation in the middle of the distribution.

Scaling Option 2: Linear: Piecewise with Limits

One thing we could do to address this is create a piecewise linear scale, where we select min and max Theta values to derive this scale, and any scores below or above (respectively) those thresholds are given the low or high score. For instance, our scale could be:

  • If \(\theta <= -3.0\), then \(S = 200\)
  • If \(\theta > -3.0\) and \(\theta < 3.0\), then \(S = f(\theta)\)
  • If \(\theta > 3.0\), then \(S = 800\)

Here’s how that would look if we set our min(theta) and max(theta) to -4 and 4, respectively.

A.piece <- (800 - 200) / (4 - (-4))
B.piece <- 200 - A.piece * (-4)

scaled.scores <- scaled.scores %>%
  mutate(scale_linear_piece = ceiling(A.piece * ability_est + B.piece),
         scale_linear_piece = ifelse(scale_linear_piece < 200, 200, 
                                     ifelse(scale_linear_piece > 800, 800, scale_linear_piece)))

scaled.scores %>%
  ggplot(aes(x = ability_est)) +
  geom_point(aes(y = scale_linear, color = "Limits"), alpha = 0.2) +
  geom_point(aes(y = scale_linear_piece, color = "Limits, Piecewise"), alpha = 0.2) +
  scale_color_manual(
    name = "Method",
    values = c("Limits" = "blue3", "Limits, Piecewise" = "green3"),
    labels = c("Limits", "Limits, Piecewise")
  ) +
  labs(
    title = "Linear Scaling Results",
    x = "Theta",
    y = "Scaled Score"
  ) +
  theme_clean() +
  theme(legend.position.inside = c(0.05, 0.95),
        legend.position = "inside",
        legend.justification = c("left", "top"))  # Adjust justification

scaled.scores %>% scale_breakdown(scale_linear_piece)
##      group scale_min scale_max scale_range
## 1    Total       200       800         600
## 2   0%-20%       200       428         228
## 3  20%-40%       379       481         102
## 4  40%-60%       442       522          80
## 5  60%-80%       488       567          79
## 6 80%-100%       548       800         252

That looks a lot more reasonable, with the middle quintiles each having a spread of around 80 to 100 points. We could take this a step further and set the theta limits to something even more narrow (e.g., c(-3,3)), but we’ll leave it be for now. Let’s try one more linear scaling option.

Scaling Option 3: Linear based on a Distribution

Another option is a transformation based using the mean and SD of our normative (base sample) scores. Let’s assume we want our transformed scale to have the following features:

  • Mean = 500
  • SD = 100

The formula for this is: \[S = \frac{\sigma(S)}{\sigma(\theta)} \theta + \left[\mu(S) - \frac{\sigma(S)}{\sigma(\theta)} \mu(\theta)\right]\] Where:

  • \(S\) : Scaled score
  • \(\theta\) : Latent ability estimate
  • \(\mu(S)\) : Desired mean of the scaled scores
  • \(\sigma(S)\) : Desired standard deviation of the scaled scores
  • \(\mu(\theta)\) : Mean of the latent ability estimates
  • \(\sigma(\theta)\) : Standard deviation of the latent ability estimates

So let’s implement this and take a look at how that compares to our scaling based on using scale limits

sd.S <- 100
sd.Theta <- sd(scaled.scores$ability_est)
mean.S <- 500 
mean.Theta <- mean(scaled.scores$ability_est)

scaled.scores <- scaled.scores %>%
  mutate(scale_linear_MS = (sd.S / sd.Theta) * ability_est + (mean.S - (sd.S / sd.Theta) * mean.Theta))

scaled.scores %>%
  ggplot(aes(x = ability_est)) +
   geom_point(aes(y = scale_linear, color = "Limits"), alpha = 0.2) +
  geom_point(aes(y = scale_linear_piece, color = "Limits, Piecewise"), alpha = 0.2) +
  geom_point(aes(y = scale_linear_MS, color = "Distribution"), alpha = 0.2) +
  scale_color_manual(
    name = "Method",
    values = c("Limits" = "blue3", "Limits, Piecewise" = "green3", "Distribution" = "orange3"),
    labels = c("Distribution", "Limits", "Limits, Piecewise")
  ) +
  labs(
    title = "Linear Scaling Results",
    x = "Theta",
    y = "Scaled Score"
  ) +
  theme_clean() +
  theme(legend.position.inside = c(0.95, 0.05), 
        legend.position = "inside",
        legend.justification = c("right", "bottom"))  # Adjust justification

psych::describe(scaled.scores$scale_linear_MS)
##    vars    n mean  sd median trimmed mad  min max range  skew kurtosis   se
## X1    1 2000  500 100    508     506  80 -269 830  1099 -1.78     9.33 2.24

As we can see, while this new scale spreads out the scaled scores more, it also extends well below the 200 point scale limit we identified earlier. Perhaps we’re comfortable offering scores of -269 to test-takers, but probably not. Let’s take a look at our quintiles.

scaled.scores %>% scale_breakdown(scale_linear_MS)
##      group scale_min scale_max scale_range
## 1    Total      -269       830      1099.2
## 2   0%-20%      -269       458       727.9
## 3  20%-40%       416       505        88.8
## 4  40%-60%       471       541        69.7
## 5  60%-80%       512       580        68.5
## 6 80%-100%       564       830       266.1

Again, the range for the bottom 20% is enormous (728), and that of our middle group is still comparatively narrow.

The linear piecewise scaling approach seems to give us the best spread for the middle quintiles, though the downside is we artificially limit (remove) variability beyond the bounds we set (e.g., everyone below -4 logits has the same scaled score)

Now let’s consider one non-linear transformation and see how it compares to the linear scaling.

Scaling Option 4: Non-Linear Logistic

There are many different non-linear approaches we could take, but let’s just do take a logistic approach. Here’s one such transformation we could use:

\[S = \frac{S_{\text{max}} - S_{\text{min}}}{1 + e^{-a(\theta - \theta_0)}} + S_{\text{min}}\]

Where \(\theta_0\) is the midpoint of the latent ability scale, and \(a\) is a parameter that controls the steepness of the logistic curve. Let’s just assume \(a = 1\) and leave it out of our calculation.

# Median of the theta
mid.Theta <- median(scaled.scores$ability_est)

# Add a scaled score based on logistic 
scaled.scores <- scaled.scores %>%
  mutate(scale_nl_logistic = (800 - 200)/(1 + exp(-1 *(ability_est - mid.Theta))) + 200)

scaled.scores %>%
  ggplot(aes(x = ability_est)) +
   geom_point(aes(y = scale_linear, color = "L: Limits"), alpha = 0.2) +
  geom_point(aes(y = scale_linear_piece, color = "L: Limits, Piecewise"), alpha = 0.2) +
  geom_point(aes(y = scale_linear_MS, color = "L: Distribution"), alpha = 0.2) +
  geom_point(aes(y = scale_nl_logistic, color = "NL: Logistic"), alpha = 0.2) +
  scale_color_manual(
    name = "Method",
    values = c("L: Limits" = "blue3", "L: Limits, Piecewise" = "green3", "L: Distribution" = "orange3",
               "NL: Logistic" = "red3"),
    labels = c("L: Distribution", "L: Limits", "L: Limits, Piecewise",
               "NL: Logistic")
  ) +
  labs(
    title = "Scaling Results",
    x = "Theta",
    y = "Scaled Score"
  ) +
  theme_clean() +
    theme(legend.position.inside = c(0.95, 0.05),
        legend.position = "inside",
        legend.justification = c("right", "bottom"))  # Adjust justification

scaled.scores %>% scale_breakdown(scale_nl_logistic)
##      group scale_min scale_max scale_range
## 1    Total       200       796         596
## 2   0%-20%       200       391         191
## 3  20%-40%       318       493         175
## 4  40%-60%       417       573         156
## 5  60%-80%       507       650         143
## 6 80%-100%       620       796         176

That gives us a much more even distribution of scale ranges for each of our quintiles. However, we should note some downsides to using non-linear transformations of scaled scores, especially when it comes to tracking progress over time.

  1. Loss of Interpretability:
    • Non-linear transformations can make the interpretation of scores more complex. For example, with a linear scale, a unit increase in the score has a consistent meaning across the entire range of the scale. In contrast, with a non-linear scale, the meaning of a unit increase can vary depending on where it occurs on the scale. This variability can make it difficult for stakeholders to understand and interpret changes in scores over time.
  2. Inconsistency in Growth Measurement
    • Non-linear transformations can distort the measurement of growth or progress. For instance, in educational assessments, we often want to measure how much a student’s ability has increased from one point in time to another. If the scale is non-linear, the amount of progress required to move from one score to another can vary, making it difficult to compare growth consistently across different parts of the scale.
  3. Difficulty in Setting and Comparing Benchmarks
    • Setting benchmarks or cut scores (e.g., for proficiency levels) becomes more challenging with non-linear scales. Benchmarks set on a non-linear scale may not correspond to intuitive or easily understood levels of ability or performance. Additionally, comparing benchmarks across different groups or time periods can be problematic if the non-linear transformation is not consistent across the entire scale.
  4. Complexity in Communication
    • Communicating results to stakeholders, such as educators, students, parents, or policymakers, can become more difficult with non-linear scales. Stakeholders may have a harder time understanding what the scores mean and how they relate to each other, which can undermine confidence in the assessment system.
  5. Issues with Longitudinal Comparability
    • Longitudinal studies that track progress over multiple years require consistent and comparable scales. Non-linear transformations can introduce discrepancies that complicate the analysis of long-term trends. This inconsistency can lead to misinterpretations of data and flawed conclusions about progress or effectiveness of interventions.
  6. Potential for Misleading Inferences
    • Non-linear scales can sometimes lead to misleading inferences about relative performance. For example, two students with the same score difference on a non-linear scale might have very different actual performance differences depending on where they fall on the scale. This can affect decisions based on these scores, such as resource allocation or instructional adjustments.

Given the drawbacks of using a non-linear scaling method, the piecewise method is probably the most appropriate for our situation.

Now that we’ve seen Vertical Scaling in action with 2 groups, let’s move into the next example in which we use Vertical Scaling across 6 grades and simulate longitudinal data for each timepoint.

Example 2: Six Groups

The Dataset

TK07 / Tong - Kolen (2007): This dataset includes unidimensional three parameter logistic model (3PL) item parameter estimates from simulated dataset based on items from the 1992 Iowa Test of Basic Skills vocabulary test. The data are for six groups (grades 3-8) with varying numbers of common items between adjacent grades. We can load the TK07 dataset from plink.

Let’s load the dataset, and see how many items we’re working with.

TK07 <- TK07
sapply(TK07$pars, nrow)
## grade3 grade4 grade5 grade6 grade7 grade8 
##     26     34     37     40     41     43
sapply(TK07$common, nrow)
## g3.g4 g4.g5 g5.g6 g6.g7 g7.g8 
##    13    21    16    24    17

So we’ve got one form for each of 6 grades (g3-g8), with adjacent forms having common items. For instance, this output shows there are 26 items in g3 and 34 items in g4, and those two grades have 13 items in common.

Vertical Scaling

Let’s perform vertical scaling using this data. Here are the settings we’ll use for this example: 1. We’ll rescale using the Mean/Sigma method, "MS". 2. The base group this time will be group 1 (Grade 3) 3. We’ll use the scaling constant of 1.7, since these items were also calibrated using BILOG-MG. 4. We won’t throw out any common items this time.

# Get the # items items in each 
tk07.n.items <- list(nrow(TK07$pars$grade3),
                     nrow(TK07$pars$grade4),
                     nrow(TK07$pars$grade5),
                     nrow(TK07$pars$grade6),
                     nrow(TK07$pars$grade7),
                     nrow(TK07$pars$grade8))

# Create irt.pars object with six groups (all dichotomous items)
tk07.pm <- lapply(tk07.n.items, as.poly.mod)

# Number of response categories for each item in each dataset (all )
tk07.cat <- lapply(tk07.n.items, function(x) rep(2, x))

tk07.irt.pars <- as.irt.pars(# Turn values into an `irt.pars` object
  x = TK07$pars,             # An object of class `irt.pars` with multiple groups
  common = TK07$common,      # Matrix of common items between the groups
  cat = tk07.cat,            # Number of response categories (2) for each item in the dataset
  poly.mod = tk07.pm,        # Lists the poly.mod objects that defines the models for each item
  grp.names = paste0("grade",3:8)
                             )

# rescale the item parameters using the Mean/Sigma linking constants,
tk07.out <- plink(tk07.irt.pars, 
             rescale = "MS", # Transform parameters to the base scale, using Mean/Sigma (Mean/SD) method
             base.grp = 1,   # The base scale is Group 1 (default; Grade 3)
             D = 1.7)        # Scaling constant to transform the logistic model; D=1.7 ~~ normal ogive model

tk07.pars.out <- link.pars(tk07.out)

We won’t examine the scaling results this time, since we’ll have 10 tables describing the linear scaling results for all adjacent grade pairs.

Simulate Data

Now, let’s simulate data as if we are tracking the same group of 1000 students longitudinally.

True (Latent) Abilities

To simulate longitudinal data, we’ll start by estimating g3 ability, apply a base growth rate, then mean center the abilities for each grade. Since we’re working with 6 groups, we’ll put these abilities in a dataframe tk07.true.abilities to help us keep track of them.

# 1. Simulate ability levels
n_test_takers <- 1000
set.seed(012)

# Simulate initial abilities
tk07_initial_abilities <- rnorm(n_test_takers)

# Select a growth rate, and adjust for variability by starting ability (lower ability students grow a little less rapidly, and higher ability students grow a little more rapidly)
base_growth_rate <- 0.3  # Average growth per time period

# Create abilities dataframe
tk07.true.abilities <- data.frame(g3 = tk07_initial_abilities) %>%
  # Create a dataframe of jitter to add to longitudinal data
  mutate( 
    # Simulate growth, including "0.05 * gX" to represent that higher ability students make greater gains.
    g4 = base_growth_rate + g3 * 1.05,
    g5 = base_growth_rate + g4 * 1.05,
    g6 = base_growth_rate + g5 * 1.05,
    g7 = base_growth_rate + g6 * 1.05,
    g8 = base_growth_rate + g7 * 1.05,
    #Then re-center those scores about zero, since we're simulating response patterns for separate tests, with means of about 0 anyway. When we simulate response patterns and apply scaling, the growth will be evident.
    g4 = g4 - mean(g4), 
    g5 = g5 - mean(g5),
    g6 = g6 - mean(g6),
    g7 = g7 - mean(g7),
    g8 = g8 - mean(g8),
    student = 1:n())

# And visualize the growth 
tk07.true.abilities %>%
  pivot_longer(cols = c(g3, g4, g5, g6, g7, g8),
               names_to = "grade",
               values_to = "ability") %>%
  arrange(ability) %>%
  ggplot(., aes(x = grade, y = ability, group = student, color = as.factor(student))) +
    geom_line(alpha = 0.2) +
    theme_clean() +
    theme(legend.position = "none")

tk07.true.abilities %>%
  select(-student) %>%
  sapply(psych::describe) %>%
  t() 
##    vars n    mean      sd    median  trimmed  mad   min   max  range skew  
## g3 1    1000 -0.0264   0.959 -0.0412 -0.031   0.891 -3.05 3.11 6.15  0.0233
## g4 1    1000 1.06e-16  1.01  -0.0155 -0.00478 0.935 -3.17 3.29 6.46  0.0233
## g5 1    1000 5.8e-17   1.06  -0.0163 -0.00502 0.982 -3.33 3.45 6.78  0.0233
## g6 1    1000 -4.23e-17 1.11  -0.0171 -0.00527 1.03  -3.5  3.63 7.12  0.0233
## g7 1    1000 1.17e-16  1.17  -0.018  -0.00553 1.08  -3.67 3.81 7.48  0.0233
## g8 1    1000 -1.5e-16  1.22  -0.0189 -0.00581 1.14  -3.85 4    7.85  0.0233
##    kurtosis se    
## g3 0.103    0.0303
## g4 0.103    0.0319
## g5 0.103    0.0334
## g6 0.103    0.0351
## g7 0.103    0.0369
## g8 0.103    0.0387

Here are the simulated abilities for our sample of sims over time. As you can see, the growth trajectories are just straight lines (given the flat growth rate), which is fine, since our process for simulating response patterns will introduce some variabilty. Right now, each timepoint has a mean of about 0, yet the ability variability increases slightly over time.

Response Patterns

Now let’s generate response patterns for each timepoint, using the same method as before, based on the original item parameters.

set.seed(678) # for rbinom

# Generate responses for all forms
tk07.responses <- list(
  "responses.3" = generate_responses(TK07$pars$grade3, tk07.true.abilities$g3),
  "responses.4" = generate_responses(TK07$pars$grade4, tk07.true.abilities$g4),
  "responses.5" = generate_responses(TK07$pars$grade5, tk07.true.abilities$g5),
  "responses.6" = generate_responses(TK07$pars$grade6, tk07.true.abilities$g6),
  "responses.7" = generate_responses(TK07$pars$grade7, tk07.true.abilities$g7),
  "responses.8" = generate_responses(TK07$pars$grade8, tk07.true.abilities$g8)
)

lapply(tk07.responses, function(x) {psych::describe(rowSums(x))}) %>%
  bind_rows %>%
  as_tibble() %>%
  mutate(timepoint = c("g3","g4","g5","g6","g7","g8")) %>%
  relocate(timepoint)
## # A tibble: 6 × 14
##   timepoint  vars     n  mean    sd median trimmed   mad   min   max range
##   <chr>     <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 g3            1  1000  16.8  4.49   17      17.0  4.45     4    26    22
## 2 g4            1  1000  20.9  6.22   21      21.1  7.41     4    34    30
## 3 g5            1  1000  21.9  7.38   22      21.9  8.90     5    37    32
## 4 g6            1  1000  23.3  9.16   24      23.4 11.9      4    40    36
## 5 g7            1  1000  23.6  9.58   24      23.6 11.9      4    41    37
## 6 g8            1  1000  23.9 10.8    23.5    23.8 14.1      1    43    42
## # ℹ 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>

Again, take these results with a grain of salt. But one thing of note is that, just like our latent abilities, the SD of raw scores increases over time (as intended).

Estimate Abilities

Now let’s estimate the abilities, standard error, and percentile rank for our test-takers.

tk07.abilities <- list(
    "g3" = est_abilities_parallel(tk07.responses$responses.3 %>% data.frame,
                                as = tk07.pars.out$grade3[,1],
                                bs = tk07.pars.out$grade3[,2],
                                cs = tk07.pars.out$grade3[,3]
                                ) %>% 
    mutate(timepoint = "g3", 
           ability_true = tk07.true.abilities$g3),
  
    "g4" = est_abilities_parallel(tk07.responses$responses.4,
                                as = tk07.pars.out$grade4[,1],
                                bs = tk07.pars.out$grade4[,2],
                                cs = tk07.pars.out$grade4[,3]
                                ) %>% 
    mutate(timepoint = "g4", 
           ability_true = tk07.true.abilities$g4),
  
    "g5" = est_abilities_parallel(tk07.responses$responses.5,
                                as = tk07.pars.out$grade5[,1],
                                bs = tk07.pars.out$grade5[,2],
                                cs = tk07.pars.out$grade5[,3]
                                ) %>% 
    mutate(timepoint = "g5", 
           ability_true = tk07.true.abilities$g5),
  
    "g6" = est_abilities_parallel(tk07.responses$responses.6,
                                as = tk07.pars.out$grade6[,1],
                                bs = tk07.pars.out$grade6[,2],
                                cs = tk07.pars.out$grade6[,3]
                                ) %>% 
    mutate(timepoint = "g6", 
           ability_true = tk07.true.abilities$g6),
  
    "g7" = est_abilities_parallel(tk07.responses$responses.7,
                                as = tk07.pars.out$grade7[,1],
                                bs = tk07.pars.out$grade7[,2],
                                cs = tk07.pars.out$grade7[,3]
                                ) %>% 
    mutate(timepoint = "g7", 
           ability_true = tk07.true.abilities$g7),
  
    "g8" = est_abilities_parallel(tk07.responses$responses.8,
                                as = tk07.pars.out$grade8[,1],
                                bs = tk07.pars.out$grade8[,2],
                                cs = tk07.pars.out$grade8[,3]
                                ) %>% 
    mutate(timepoint = "g8", 
           ability_true = tk07.true.abilities$g8)
) %>% bind_rows

# And the abilities if we used the old parameters
tk07.abilities.old <- list(
    "g3" = est_abilities_parallel(tk07.responses$responses.3 %>% data.frame,
                                as = TK07$pars$grade3[,1],
                                bs = TK07$pars$grade3[,2],
                                cs = TK07$pars$grade3[,3]
                                ) %>% 
    mutate(timepoint = "g3", 
           ability_true = tk07.true.abilities$g3),
  
    "g4" = est_abilities_parallel(tk07.responses$responses.4,
                                as = TK07$pars$grade4[,1],
                                bs = TK07$pars$grade4[,2],
                                cs = TK07$pars$grade4[,3]
                                ) %>% 
    mutate(timepoint = "g4", 
           ability_true = tk07.true.abilities$g4),
  
    "g5" = est_abilities_parallel(tk07.responses$responses.5,
                                as = TK07$pars$grade5[,1],
                                bs = TK07$pars$grade5[,2],
                                cs = TK07$pars$grade5[,3]
                                ) %>% 
    mutate(timepoint = "g5", 
           ability_true = tk07.true.abilities$g5),
  
    "g6" = est_abilities_parallel(tk07.responses$responses.6,
                                as = TK07$pars$grade6[,1],
                                bs = TK07$pars$grade6[,2],
                                cs = TK07$pars$grade6[,3]
                                ) %>% 
    mutate(timepoint = "g6", 
           ability_true = tk07.true.abilities$g6),
  
    "g7" = est_abilities_parallel(tk07.responses$responses.7,
                                as = TK07$pars$grade7[,1],
                                bs = TK07$pars$grade7[,2],
                                cs = TK07$pars$grade7[,3]
                                ) %>% 
    mutate(timepoint = "g7", 
           ability_true = tk07.true.abilities$g7),
  
    "g8" = est_abilities_parallel(tk07.responses$responses.8,
                                as = TK07$pars$grade8[,1],
                                bs = TK07$pars$grade8[,2],
                                cs = TK07$pars$grade8[,3]
                                ) %>% 
    mutate(timepoint = "g8", 
           ability_true = tk07.true.abilities$g8)
) %>% bind_rows

# Add Percentile Rank, raw score, and percent scores
tk07.abilities <- tk07.abilities %>%
  group_by(timepoint) %>%
  mutate(pct_rank = percent_rank(ability_est)) %>%
  ungroup %>%
  mutate(pct_rank_overall = percent_rank(ability_est),
         student = rep(1:1000,6)) %>%
  left_join(lapply(TK07$pars, nrow) %>%  # Pull out # of items for each form
              unlist %>% 
              data.frame("n_items" = .) %>%
              rownames_to_column("timepoint") %>% 
              mutate(timepoint = gsub("grade","g",timepoint) %>% as.character,
                     n_items = as.numeric(n_items)),
            by = "timepoint") %>%
  mutate(raw_score = unlist(lapply(tk07.responses, rowSums)),
         pct_score = raw_score / n_items) %>%
  select(-n_items)

tk07.abilities.old <- tk07.abilities.old %>%
  group_by(timepoint) %>%
  mutate(pct_rank = percent_rank(ability_est)) %>%
  ungroup %>%
  mutate(pct_rank_overall = percent_rank(ability_est),
         student = rep(1:1000,6)) %>%
  mutate(raw_score = unlist(lapply(tk07.responses, rowSums)))

# Descriptive Stats
tk07.abilities %>% 
  group_by(timepoint) %>% 
  summarise(
    n = n(),
    mean = mean(ability_est),
    sd = sd(ability_est),
    min = min(ability_est),
    max = max(ability_est), 
    range = max - min,
    median = median(ability_est))
## # A tibble: 6 × 8
##   timepoint     n     mean    sd    min   max range median
##   <chr>     <int>    <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 g3         1000 0.000127  1.25 -11.7   3.29  15.0 0.0174
## 2 g4         1000 0.434     1.45 -12.7   4.34  17.0 0.508 
## 3 g5         1000 0.813     1.55  -8.91  4.86  13.8 0.872 
## 4 g6         1000 1.13      1.74  -8.55  5.60  14.1 1.26  
## 5 g7         1000 1.33      1.63  -7.06  5.45  12.5 1.43  
## 6 g8         1000 1.36      2.09  -6.23  5.71  11.9 1.58

These descriptive statistics are now on our vertical scale, and more interpretable. As we can see, our mean (and median) incraeses over time, as does the SD of scores (for the most part).

Visualizations

Distributions

Code to create distribution plots…

tk07.raw.mle.plot <- tk07.abilities %>%
  ggplot(., aes(x = pct_score, y = ability_est, color = timepoint)) +
  facet_wrap(~timepoint) +
  geom_point(alpha = 0.3, size = 1) +
  labs(title = "Raw Scores v. Scaled MLE Estimated Ability by Timepoint",
       x = "Percent Score",
       y = "Scaled Ability Estimate",
       color = "Timepoint") +
  theme_clean() +
  theme(legend.position = "none")

tk07.ability.plot <- tk07.abilities %>%
  filter(abs(ability_est) <= 6.0) %>%
ggplot(., aes(x = ability_est, color = as.factor(timepoint), fill = as.factor(timepoint))) +
  geom_density(alpha = 0.1) +  # Overlayed density plots with some transparency
  labs(title = "Overlayed Density Plot of Ability Estimates by Group",
       x = "Theta",
       y = "Density",
       color = "Timepoint",
       fill = "Timepoint") +
  scale_x_continuous(limits = c(-6, 6)) +
  scale_y_continuous(limits = c(0, 0.6)) +
  theme_clean()

tk07.ability.pct.plot <- tk07.abilities %>%
  ggplot(., aes(x = ability_est, y = pct_rank, color = timepoint)) +
  geom_line() +
  scale_x_continuous(limits = c(-5,5)) + 
  labs(title = "Ability and Percentile Rank by Timepoint",
       x = "Theta",
       y = "Percentile Rank",
       color = "Timepoint") +
  theme_clean() +
  theme(legend.position.inside = c(0.05, 0.95),
    legend.position = "inside",
        legend.justification = c("left", "top"))  # Adjust justification

tk07.se.plot <- tk07.abilities %>%
  filter(abs(ability_est) <= 6.0) %>%
  ggplot(., aes(x = ability_est, y = ability_est_se, color = timepoint)) +
  geom_jitter(alpha = 0.2) + 
  geom_point(alpha = 0.2, size = 1) +
  scale_x_continuous(limits = c(-6, 6)) +
  scale_y_continuous(limits = c(0, NA)) +
  labs(title = "SE of Ability Estimate by Group",
     x = "Theta",
     y = "Standard Error (SE)",
     color = "Timepoint",
     fill = "Timepoint") +
  theme_clean()
Raw and Scaled MLE
print(tk07.raw.mle.plot)

As we can see, there is a similar relationship between Score and Scaled Ability Estimate for all groups. Note that each form has different number of items, so I’ve used pct_score instead of raw_score for comparability.

Ability Density
suppressWarnings(print(tk07.ability.plot))

Based the ability estimates from vertical scaling, we can see the the ability distributions increase (and becomre more spread) across grade.

Percentile Rank
suppressWarnings(print(tk07.ability.pct.plot))

Standard Error
suppressWarnings(print(tk07.se.plot))

Density, Info, and SE

Now let’s look at the test information and standard error for each

# List out the names of our grades
grades <- c("g3","g4","g5","g6","g7","g8")

# Create test information dataframe for plotting
for(i in 1:length(tk07.pars.out)) {
  if(i == 1) {
    tk07.ti <- data.frame(thetas = seq(-4,6,.1))
  } 
  temp.ti <- test_info(thetas = seq(-4,6,.1),
          a = tk07.pars.out[[i]][,1],
          b = tk07.pars.out[[i]][,2],
          c = tk07.pars.out[[i]][,3]
          ) %>%
      select(-theta)
  colnames(temp.ti) <- c(paste0(grades[i],".info"),paste0(grades[i],".se")) 
  tk07.ti <- tk07.ti %>%
    bind_cols(temp.ti)
}

# For loop to create plots for every 
for(i in 1:length(grades)) {
  
  if(i == 1) {
    plots.ls <- list()
  }
  
  # Prep objects & values for ggplot
  abilities <- tk07.abilities %>% filter(timepoint == grades[i])
  abilities.old <- tk07.abilities.old %>% filter(timepoint == grades[i])
  abilities.all <- abilities %>% 
    mutate(Scale = "Scaled") %>% 
    bind_rows(abilities.old %>% 
                mutate(Scale = "Unscaled")
              )
  ti <- tk07.ti %>% select(thetas, starts_with(grades[i]))
  colnames(ti) <- c("theta", "info", "se")
  max.info <- max(ti$info)
  max.se <- max(ti$se)
  max.density <- max(max(density(abilities.all$ability_est[abilities.all$Scale=="Scaled"])$y),
                     density(abilities.all$ability_est[abilities.all$Scale=="Unscaled"])$y)
  
  # Get colors to match the previous plots
  default_colors <- hcl(h = seq(15, 375, 
                                length = length(grades) + 1), 
                        l = 65,
                        c = 100)[1:length(grades)]
  
  # Create plot
  plots.ls[[i]] <- ggplot() +
    # Test information function
    geom_line(data = ti, aes(x = theta, y = info),
              color = "purple", linewidth = 0.5) +
    # Standard error
    geom_line(data = ti, aes(x = theta, y = se),
              color = "orange3", linewidth = 0.5, linetype = "dotted") +
    # Density plot
    geom_density(data = abilities.all, 
                 aes(x = ability_est,
                     y = after_stat(density) * max.info / max.density,
                     fill = Scale,
                     color = Scale),
                 alpha = 0.1) +
    # Set colors for fill and color
    scale_fill_manual(values = c("Scaled" = default_colors[i], "Unscaled" = "grey")) +
    scale_color_manual(values = c("Scaled" = default_colors[i], "Unscaled" = "grey")) +
    labs(title = paste0(toupper(grades[i]), " Ability Estimate Density Plot with Test Info and SE"),
         x = "Theta",
         y = "Test Information / Standard Error") +
    theme_clean() +
    scale_x_continuous(limits = c(-6, 6)) +
    # Set y-axis limits for Test Information and SE
    scale_y_continuous(
      limits = c(0, 20),
      name = "Test Information / Standard Error",
      sec.axis = sec_axis(~ . * max.density / max.info,
                          name = "Density of Ability Estimates")
    ) +
    theme(plot.title = element_text(size = 12),  # Adjust title size
          legend.position.inside = c(0.05, 0.95),
          legend.position = "inside",
          legend.justification = c("left", "top")  # Adjust justification
    )
}
Grade 3

Grade 4

Grade 5

Grade 6

Grade 7

Grade 8

Score Reporting: Piecewise Linear Scaling

Let’s take the Piecewise Linear scaling method as before. But first, let’s take a look at the distribution of our ability estimates:

psych::describe(tk07.abilities$ability_est)
##    vars    n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
## X1    1 6000 0.84 1.71    0.9    0.91 1.36 -12.7 5.71  18.4 -1.35      6.6 0.02

Since we have a broader range, let’s select more relaxed limits for our theta scale: c(-5, 5). Here’s a plot of the scaling for all grades.

A.piece <- (800 - 200) / (5 - (-5))
B.piece <- 200 - A.piece * (-5)

tk07.scaled.scores <- tk07.abilities %>%
  mutate(scale_linear_piece = ceiling(A.piece * ability_est + B.piece),
         scale_linear_piece = ifelse(scale_linear_piece < 200, 200, 
                                     ifelse(scale_linear_piece > 800, 800, scale_linear_piece)))

tk07.scaled.scores %>%
  ggplot(., aes(x = ability_est, y = scale_linear_piece, group = timepoint, color = timepoint)) +
  geom_jitter(alpha = 0.2) +
  # scale_color_manual(name = "Grade") +
  labs(
    title = "Linear Scaling Results",
    x = "Theta",
    y = "Scaled Score",
    color = "Timepoint",
    group = "Timepoint"
  ) +
  theme_clean() +
  theme(legend.position.inside = c(0.05, 0.95),  # Position in the top left corner
        legend.position = "inside",
        legend.justification = c("left", "top"))  # Adjust justification

Looks like we have more students with extreme values that received the scale minimum of 200, and a few that received a maximum of 800. Let’s see how many min and max scores we have by grade.

tk07.scaled.scores %>%
  group_by(timepoint) %>%
  summarise(n_200_score = sum(scale_linear_piece == 200),
            n_800_score = sum(scale_linear_piece == 800))
## # A tibble: 6 × 3
##   timepoint n_200_score n_800_score
##   <chr>           <int>       <int>
## 1 g3                  3           0
## 2 g4                  5           0
## 3 g5                 10           0
## 4 g6                 18           6
## 5 g7                 13           4
## 6 g8                 36          10

Somewhat expected, we have more 800 scores in higher grades, but interestingly we also have more lower scores in the higher grades. This is likely a result of more variability at higher grades, and perhaps the way we simulated response data, rather than the vertical scaling itself.

Let’s take a look at our scale range by quintile for the piecewise linear approach:

tk07.scaled.scores %>% scale_breakdown(scale_linear_piece)
## # A tibble: 6 × 4
##   group    scale_min scale_max scale_range
##   <chr>        <dbl>     <dbl>       <dbl>
## 1 Total          200       800         600
## 2 0%-20%         200       519         319
## 3 20%-40%        448       575         127
## 4 40%-60%        484       616         132
## 5 60%-80%        518       672         154
## 6 80%-100%       555       800         245

Growth Modeling

Since we have (simulated) longitudinal data, we may also want to perform growth modeling of the data. Here’s an example of what that could look like.

Visualizing Student Scores

First, let’s visualize student growth over time using the scaled scores.

tk07.scaled.scores %>%
  mutate(student = rep(1:1000,6)) %>%
  ggplot(., aes(x = timepoint, y = scale_linear_piece, group = student, color = as.factor(student))) +
  geom_line(alpha = 0.2) +
  labs(
    title = "Scaled Scores Over Time",
    x = "Timepoint",
    y = "Scaled Score"
  ) +
  theme_clean() +
  theme(legend.position = "none")

It seems like we have a trend upward, which is good; we would expect/hope that students improve in ability over time.

Let’s create a linear mixed model of student growth over time. If you’re not familiar with mixed modeling, it basically means that we have both fixed and random effects.

  • Fixed effects are effects that apply to EVERYONE in the model. If you’ve performed simple linear regression before, all parameters for the predictors in the model are fixed effects. In our case, the fixed effects would be an intercept (probably around 500 or so), and a slightly positive fixed effect for timepoint (scores increase over time).

  • Random effects accounts for individual variability within for a certain level within the model. In other words, certain predictors are allowed to vary randomly between cases. In our example, our predictor is Scaled Score, and random effects would apply to students, and we could have:

    1. Random Intercept of Scaled Score: every student starts at different intercepts, which deviates from our Intercept Fixed Effect,
    2. Random Slopes of Scaled Score: every student has a different growth trajectory, which deviates from our Timepoint Fixed Effect, or
    3. Both Random Slopes and Random Intercepts

Modeling

Let’s estimate (1) a simple linear regression, then (2) a model with random intercepts, and (3) random slopes and intercepts, and compare the two.

Specifically, our model will predict Scaled Score by Timepoint (grade),

Note, we’ll change our timepoints so g3 = 0, g4 = 1, and so on.

Model 1: Simple Linear Model

Let’s use the nlme::gls() function to estimate a simple linear regression.

tk07.scaled.scores.modeling <- tk07.scaled.scores %>%
  mutate(timepoint = as.numeric(gsub("g","",timepoint)) - 3) # Set timepoint g3 = 0

tk07.growth.model.1 <- gls(data = tk07.scaled.scores.modeling, 
       scale_linear_piece ~ timepoint,
       method = "ML") 

summary(tk07.growth.model.1)
## Generalized least squares fit by maximum likelihood
##   Model: scale_linear_piece ~ timepoint 
##   Data: tk07.scaled.scores.modeling 
##     AIC   BIC logLik
##   71148 71168 -35571
## 
## Coefficients:
##             Value Std.Error t-value p-value
## (Intercept)   511     2.080   245.4       0
## timepoint      17     0.687    24.4       0
## 
##  Correlation: 
##           (Intr)
## timepoint -0.826
## 
## Standardized residuals:
##     Min      Q1     Med      Q3     Max 
## -4.3415 -0.4974  0.0511  0.5837  2.7272 
## 
## Residual standard error: 90.9 
## Degrees of freedom: 6000 total; 5998 residual

Here’s how to interpret some of the output:

  • The (Intercept) is at timepoint 0 (g3), where the average Scaled Score is 511. This parameter is significantly different from 0 (p-value = 0).
  • The timepoint is the slope of our predictor “timepoint”. It’s positive (Value = 17) and significantly differnet from 0, so there is growth, and it represents that with each increase (or decrease) in 1 grade from the intercept, our Scaled Score is estimated to increase (or decrease) by 17 points.
Model 2: Random intercepts

Next, let’s see if allowing intercepts to vary across student our model improves.

tk07.growth.model.2 <- lme(fixed = scale_linear_piece ~ timepoint,
                           random = ~ 1 | student,
                           data = tk07.scaled.scores.modeling,
                           method = "ML")

summary(tk07.growth.model.2)
## Linear mixed-effects model fit by maximum likelihood
##   Data: tk07.scaled.scores.modeling 
##     AIC   BIC logLik
##   65702 65729 -32847
## 
## Random effects:
##  Formula: ~1 | student
##         (Intercept) Residual
## StdDev:        78.9     45.1
## 
## Fixed effects:  scale_linear_piece ~ timepoint 
##             Value Std.Error   DF t-value p-value
## (Intercept)   511     2.701 4999   189.0       0
## timepoint      17     0.341 4999    49.3       0
##  Correlation: 
##           (Intr)
## timepoint -0.316
## 
## Standardized Within-Group Residuals:
##     Min      Q1     Med      Q3     Max 
## -6.7203 -0.3515  0.0606  0.4561  3.7962 
## 
## Number of Observations: 6000
## Number of Groups: 1000
anova(tk07.growth.model.1, tk07.growth.model.2)
##                     Model df   AIC   BIC logLik   Test L.Ratio p-value
## tk07.growth.model.1     1  3 71148 71168 -35571                       
## tk07.growth.model.2     2  4 65702 65729 -32847 1 vs 2    5448  <.0001

This output indicates that our Random Intercept (under Random effects:) accounts for 78.9 units of variance, and the remaining unexplained (residual) variance is 45.1. This means that of all the variance in the model, the introduction of the random intercept accounts for a little less than 2/3 of the variance.

Furthermore, the last table is an ANOVA comparing our Fixed and Random Intercept models, indicating that the random intercept model is significantly better than our Fixed model.

Model 2: Random slopes & intercepts

Now let’s see if allowing intercepts and slopes to vary across student our model improves.

tk07.growth.model.3 <- lme(fixed = scale_linear_piece ~ timepoint,
                           random = ~ timepoint | student,
                           data = tk07.scaled.scores.modeling,
                           method = "ML")

summary(tk07.growth.model.3)
## Linear mixed-effects model fit by maximum likelihood
##   Data: tk07.scaled.scores.modeling 
##     AIC   BIC logLik
##   65086 65127 -32537
## 
## Random effects:
##  Formula: ~timepoint | student
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev Corr  
## (Intercept) 58.76  (Intr)
## timepoint    8.92  0.886 
## Residual    41.88        
## 
## Fixed effects:  scale_linear_piece ~ timepoint 
##             Value Std.Error   DF t-value p-value
## (Intercept)   511     2.091 4999   244.2       0
## timepoint      17     0.424 4999    39.6       0
##  Correlation: 
##           (Intr)
## timepoint 0.241 
## 
## Standardized Within-Group Residuals:
##     Min      Q1     Med      Q3     Max 
## -6.1165 -0.3756  0.0175  0.4280  4.3806 
## 
## Number of Observations: 6000
## Number of Groups: 1000
anova(tk07.growth.model.1, tk07.growth.model.2, tk07.growth.model.3)
##                     Model df   AIC   BIC logLik   Test L.Ratio p-value
## tk07.growth.model.1     1  3 71148 71168 -35571                       
## tk07.growth.model.2     2  4 65702 65729 -32847 1 vs 2    5448  <.0001
## tk07.growth.model.3     3  6 65086 65127 -32537 2 vs 3     619  <.0001

The ANOVA comparing all of our models indicates that our last model, with Random Slopes and Intercepts, is significantly better than our Random Intercepts model, which is significantly better than our Fixed model.

Visualize Growth Model

Finally, let’s visualize the growth model. Note the random effects from the lme() models are effects beyond the fixed effects (so an individual’s actual intercept is fixed_intercept + random_intercept; same w/ the slope.)

fixed_intercept <- tk07.growth.model.3$coefficients$fixed[1]
fixed_slope <- tk07.growth.model.3$coefficients$fixed[2
                                                      ]
tk07.growth.predictions <- tk07.growth.model.3$coefficients$random %>%
  data.frame() %>%
  rename_with( ~ c("random_intercept","random_slope")) %>%
  mutate(g3 = fixed_intercept + random_intercept + 0 * (fixed_slope + random_slope),
         g4 = fixed_intercept + random_intercept + 1 * (fixed_slope + random_slope),
         g5 = fixed_intercept + random_intercept + 2 * (fixed_slope + random_slope),
         g6 = fixed_intercept + random_intercept + 3 * (fixed_slope + random_slope),
         g7 = fixed_intercept + random_intercept + 4 * (fixed_slope + random_slope),
         g8 = fixed_intercept + random_intercept + 5 * (fixed_slope + random_slope)) %>%
  pivot_longer(cols = c(g3, g4, g5, g6, g7, g8), 
               names_to = "timepoint", 
               values_to = "predicted_score") %>%
  select(timepoint, predicted_score) %>%
  mutate(student = sort(rep(1:1000,6)))

tk07.growth.predictions %>%
  ggplot(., aes(x = timepoint, y = predicted_score, group = student, color = as.factor(student))) +
  geom_line(alpha = 0.2) +
  labs(
    title = "Model-Estimated Growth Trajectories",
    x = "Timepoint",
    y = "Predicted Scaled Score"
  ) +
  theme_clean() +
  theme(legend.position = "none")

This visualization presents the ability trajectories we would expect from our simulated data, and are similar to the trajectories shown in the True (Latent) Ability section: a general increase in scores for most students, but with increased variability over time.

Summary

This vertical scaling project demonstrates how to implement and evaluate vertical scaling techniques for educational assessments. The document thoroughly covers the application of the plink package to conduct IRT-based common-item vertical scaling across multiple groups and timepoints, illustrating the process with two main examples: a simpler two-group scenario and a more complex six-grade longitudinal study.

Key achievements of this project include:

Overall, this document serves as a comprehensive guide to vertical scaling within the context of educational measurement, offering valuable insights and methodologies that can be applied to similar assessment scenarios. It not only enhances understanding of vertical scaling but also demonstrates the practical application of theoretical concepts using R, making it a valuable resource for both educational researchers and psychometricians.

Kolen, Michael J, and Robert L Brennan. 2014. “Test Equating, Scaling, and Linking.”
Tong, Ye, and Michael J Kolen. 2007. “Comparisons of Methodologies and Results in Vertical Scaling for Educational Achievement Tests.” Applied Measurement in Education 20 (2): 227–53.
Weeks, Jonathan P. 2010. plink: An R Package for Linking Mixed-Format Tests Using IRT-Based Methods.” Journal of Statistical Software 35 (12): 1–33. http://www.jstatsoft.org/v35/i12/.
Young, Michael J, and Ye Tong. 2015. “Vertical Scales.” In Handbook of Test Development, 450–66. Routledge.